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^ ' Abstract 

^ ' A community in a social network is usually understood to be a group of nodes more densely 

■ connected with each other than with the rest of the network. This is an important concept 

' in most domains where networks arise: social, technological, biological, etc. For many years 

algorithms for finding communities implicitly assumed communities are nonoverlapping (leading 
to use of clustering-based approaches) but there is increasing interest in finding overlapping 
jy^ . communities. A barrier to finding communities is that the solution concept is often defined in 

^ ' terms of an NP-complete problem such as Clique or Hierarchical Clustering. 

O . This paper seeks to initiate a rigorous approach to the problem of finding overlapping com- 

munities, where "rigorous" means that we clearly state the following: (a) the object sought by 
our algorithm (b) the assumptions about the underlying network (c) the (worst-case) running 
time. 

Our assumptions about the network lie between worst-case and average-case. An average- 
case analysis would require a precise probabilistic model of the network, on which there is 
currently no consensus. However, some plausible assumptions about network parameters can be 
gleaned from a long body of work in the sociology community spanning five decades focusing on 
the study of individual communities and ego-centric networks (in graph theoretic terms, this is 
the subgraph induced on a node's neighborhood). Thus our assumptions are somewhat "local" 
in nature. Nevertheless they suffice to permit a rigorous analysis of running time of algorithms 
that recover global structure. 

Our algorithms use random sampling similar to that in property testing and algorithms for 
' dense graphs. We note however that our networks arc not necessarily dense graphs, not even in 

local neighborhoods. 

Our algorithms explore a local-global relationship between ego-centric and socio-centric net- 
works that we hope will provide a fruitful framework for future work both in computer science 
and sociology. 
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1 Introduction 



Community structure is an important characteristic of social networks and has long been studied 
in sociology. The classic paper of Luce and Perry in 1949 — which introduced the term "Clique" to 
graph theory — described a community as subsets of individuals every pair of whom are acquainted. 
The text of Scott [32] equates communities with objects such as cliques or other dense subgraphs. 
Another seminal 1974 paper, Breiger [7] develops a theory of communities in terms of affiliation 
networks, which in graph theoretic terms consist of using a bipartite graph with people on one 
side and communities on the other. In sociology today, the answer to many natural and important 
questions depends on a better understanding of community structure. Can you travel from one 
node to another random node using only a few "strong ties" [19]? Do networks contain "wide" 
bridges [10]? How much do communities overlap? 

The problem of identifying communities arose independently in other fields such as internet 
search, study of the web graph, and the problem of clustering network nodes (in networks of bio- 
logical interactions, citations, etc.). In his recent comprehensive survey of algorithmic approaches, 
Fortunato [31] divides them into two camps based upon whether or not the algorithm assumes 
— implicitly or explicitly — that communities are disjoint. Assuming disjointness implicitly leads to 
a view of a community as a nonexpanding node set: it contains many edges but has relatively few 
edges leaving it ^. This viewpoint suggests many approaches that have been tried: graph partition- 
ing, hierarchical clustering, spectral clustering, simulated annealing, modularity, betweenness, etc. 
Gibson et al. [16] discovered interesting communities via hubs and authorities; Hopcroft et al. [22] 
used agglomerative clustering on the the Citeseer database and exhibited interesting communities 
that persist over time. 

However, recently Leskovec et al [26] presented an extensive study of many of the above meth- 
ods on larger datasets, and question whether they uncover meaningful structure at larger scales. 
Leskovec et al. often detect a large "core" in the network that is difficult to break into communities. 
One possible interpretation is that if there are communities in the core, they must overlap. 

Thus there is growing interest in finding communities that are allowed to overlap, as they do 
in most real-life social networks. When communities overlap, each community will not in general 
be a nonexpanding set. (Consequently, clustering-based approaches may not work.) For instance, 
imagine that the network contains many communities that are equal-sized cliques with bounded 
pairwise intersections, and every person belongs to four communities. Then each community /clique 
will have in general as much as three times as many edges going out of it as are contained in it. 

Approaches for finding overlapping communities involve either heuristic clique-finding, or local- 
search procedures that maintain overlapping clusters and improving them via a series of heuristics. 
Sometimes a probabilistic generative model is assumed and a max-likelihood fit is attempted via 
EM and other ideas. (The very recent survey of Xie et al. [37] evaluates dozens of competing 
heuristics introduced in the last two years alone.) However, Fortunato states at the end of his 
100-page survey: 

..research on graph clustering has not yet given a satisfactory solution of the problem and leaves 
us with a number of important open issues. The field has grown in a rather chaotic way, without a 
precise direction or guidelines. . . What the field lacks the most is a theoretical framework. . . everybody 
has his or her own idea of what a community is. 

^The Girvan-Newman [17] algorithm does not explicitly define communities as nonexpanding sets. Instead it 
defines the betweenness of a node u as the fraction of nodes v, w whose shortest path passes through u. It iteratively 
removes nodes of low betweenness to isolate communities. 
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Modeling communities. Of course, it is entirely possible that Fortunato's questions have no 
clean answer, or at least one that spans all types of networks of interest. Quite possibly, a commu- 
nity in a network of gene-gene interactions is an inherently different object than one in the graph 
of facebook friendships. Furthermore, a clear definition of the problem does not in itself guarantee 
a simple algorithm — e.g., if the definition involves cliques. 

A related issue is development of models for community formation/growth whose mathematical 
analysis yields predictions testable on real-life networks. An inspiration here is the large body of 
Barabasi- Albert [5] style models which make predictions about degree distributions, graph distance 
etc. One concrete attempt to model community formation is Lattanzi and Sivakumar's [25] affili- 
ation networks model that is inspired by sociology work. In this dynamic model the communities 
are cliques (or dense subgraphs), and the mode of network growth is adding of either new indi- 
viduals (i.e., nodes) or new communities (i.e., cliques) to an affiliation network. New individual 
partially copy the community memberships of existing individuals, and new communities are off- 
shoots (subsets) of existing communities. Additional generative models with community structure 
appear in [28, 1, 6, 24]. 

What Fortunato's questions point to, though, is a seeming chicken-and-egg situation. Develop- 
ing models requires reliable data about community structure in real networks. Conversely, finding 
reliable data about community structure requires some implicit model, since without such a model 
the algorithm is consigned to solving worst-case instances of NP-hard problems like Clique, Dense 
Subgraph, and Small-set Expansion. (This issue clearly does not arise for simpler graph properties 
like node degrees.) 

This paper. We seek a more rigorous approach to the problem of finding communities, where 
"rigorous" means that we clearly state the following: (a) the object sought by our algorithm (b) 
the assumptions about the underlying network (c) the (worst-case) running time. We try to break 
out of the chicken-and-egg situation as follows. Instead of proposing a generative model per se, 
we list fairly minimal assumptions about the network that are based on theoretical and empirical 
work in sociology. Moreover, these assumptions are "local" in nature and they largely depend upon 
objects well-studied in sociology, namely ego-centric networks and individual communities. We 
think these assumptions will be satisfied by many plausible generative models (including Latanzi- 
Sivakumar, according to our simulations [30]). Thus it is interesting that these suffice to recover 
the communities. 

Since our formalization of (a) and (b) draws on sociology we expect our approach to apply 
more to, say, the Facebook graph than biological networks. Furthermore, since our algorithms 
involve random sampling in node neighborhoods, they may mesh well with a dominant approach for 
network study in sociology, namely, ego-centric analysis [34]. An ego-centric network [32] consists 
of a person (the ego) and his ties (called "alters"). Ego-centric networks and their structure have 
been extensively studied in sociology (often via questionnaires and field-study) [34] as a way to 
gain insight into the entire network [33, 8, 9, 13]. They give a view of how people develop and 
manage their social network resources [27, 33]. Data about such networks is easy to collect, even 
in the field, far from computers [21]. The are even included on the biyearly General Social Survey, 
a central resource in sociology . 

Sociological Foundations for our Assumptions. Sociologists have observed that ego-centric 
networks can be clustered into a few communities, from which they infer that individuals participate 
in only a small number of communities [35, 20, 8, 9]. Furthermore, they have observed that a 
large portion of a person's ties fit into communities [35, 20]. The celebrated theory of Feld [12] 
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gives a theoretical understanding for this fact based upon "foci." He defines a focus as "a social, 
psychological, legal, or physical entity around which joint activities are organized (e.g., workplaces, 
voluntary organizations, hangouts, families, etc.)" and his theory says that they are responsible for 
creating many ties in the network. 

Mathematically, one could say that each individual participates in up to d communities, and 
these communities explain 7 fraction of his/her ties. This information already may greatly help 
the algorithm, whose running time may depend upon d and 7. 

What is a community? As mentioned, in sociology communities are thought of as either cliques 
or as dense subgraphs which are "relatively tightly connected" together compared with the rest 
of the network (see chapter 6 of [32] for an introduction and survey of how sociology models 
communities). Sometimes variants are considered, e.g., Alba [2] considers cliques of the tth power 
of the graph (i.e., edges correspond to having distance < t in the original graph). Jackson's text [23] 
allows communities to be dense graphs and describes various models for how edges are generated 
within a community: e.g., G{n,p) or the expected degree model. 

Furthermore, not all dense graphs would pass muster as communities [32, 15]. For example, a 
union of two disjoint cliques of size t is a fairly dense graph of size 2t but the latter is not community. 
We will assume that a "community" is a dense subgraph with edges inside it generated according to 
the expected degree model. (While this assumption simplifies the exposition, in Section 4.1 we will 
observe that this assumption can be relaxed to deal with other families of dense graphs.) However, 
we leave it as future work to extend our notions to more hierarchical notions of communities such 
as in [36]. 

Another principle often used in sociology is maximality: we should not be able to add nodes to 
the community and get the same structure [15], otherwise these nodes should be considered part of 
the community. (This was the basis for the maximal clique problem introduced in Luce and Perry's 
1949 paper; see also the text of Scott[32].) Thus nodes within the community are (in some way) 
better attached to the community than nodes outside of the community. 

1.1 Formal Assumptions and Statement of Results 

Our assumptions are grounded in the above observations. The network is a graph of size n. Each 
edge {u, v) has a probability Cu^v with which it is picked. This can even be 1, so we allow adversarial 
edges. Each community C is an arbitrary subset of nodes (unknown to the algorithm), and each 
node is in at most d communities, so that the communities are allowed to overlap. We think of d as 
constant or small (though several of our algorithms run in polynomial time even for d = 2v^). 
Any edge (u, v) where u and v share a community C is a community edge. The remaining edges 
we call ambient edges. 

Assumption 1) Community edges are chosen according to the expected degree model. 

Each node u in C has an affinity pu that lies in [0, 1] . (Node u has a different affinity for 
each community it belongs to.) Two nodes u,v m C are connected by an edge with probability 
PuPv (this is called the expected degree model in the standard text Jackson [23]). Notice that the 
model is sufficiently flexible to include other well-studied cases: the subcase pu = 1 for all n G C 
corresponds to "C is a clique," and the subcase pu = \foi for all G C corresponds to "C is a dense 
subgraph generated according to the random graph model G'(/c, a)." We will usually assume pu is 
lowerbounded by some constant, so that the community is always a fairly dense graph. However in 
Section 5 we show conditions under which our algorithms can handle even sparser communities. 

For nodes v that belong to more than one community, the probability that they are connected 
is at least the maximum of PuPv for all the communities they are in. 

While Assumption 1 seems to be tending towards a "generative model," we use this formulation 
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primarily for ease of presentation. We remark in Section 4.1 that om' algorithms work so long as 
the edges are "well-distributed." 

Assumption 2) Maximality assumption with gap e (also called "Gap Assumption"). 

Nodes outside the community C are less strongly connected to it than community nodes are. 
For example, if one posits that pu = \foL for u ^ C — i.e., each u has edges to about a community 
nodes — then our assumption would say that each w ^ C has edges to less than a, a — e fraction of 
nodes in C. This seems reasonable since otherwise one should consider whether w should belong 
to C as well. Of course, such assumptions about maximality are standard, though the gap e in this 
context is new. We are able to relax the Gap Assumption in certain instances (see Section 4.2), 
though the results are no longer as clean and crisp. 

Assumption 3) Community membership accounts for a significant portion of each node's edges, 
say a constant fraction 7 > 0. 

Surprisingly, Assumptions 1-3 suffice to let us efficiently recover the communities even though 
we made no other assumptions about the ambient edges, and allow arbitrary affinities. 

Informal Theorem 1 (See Theorem 5) // all node affinities are lowerbounded by y/a then the 
communities can be recovered in n'-^^°skd ^^^^g^ where C = Ca,'y,e <ind k is size of the largest com- 
munity. 

Unfortunately, this running time is only "quasipolynomial" instead of polynomial, since log k 
could be as high as logn. We can get polynomial and even near-linear time algorithms for more 
restricted versions of the problem. The paper contains many such theorems and the following is 
representative. 

Informal Theorem 2 (See Section 2) For all constants a,5 > 0, if all affinities are \foL (note: 
this includes cliques as a special case), and all communities sizes are within a 1/5 factor of each 
other, then the communities can be recovered in time 0{nkd^^"^'^) where C = Ca,5,-f.e- More- 
over, if affinities are only guaranteed to be at least ^/a, the communities can be recovered in time 

0(nfc2 log(10/£)/a+l^C log ^ 

Our approach makes heavy use of the Gap Assumption and uses random sampling coupled 
with some exhaustive enumeration of small cliques in the subgraph induced on the sample. Similar 
ideas are well-known in property testing [18] and the related field of approximation algorithms for 
NP-hard problems in dense graphs [3, 14]. Note however that our graphs are not dense. If d = 0(1) 
then the induced graph in the neighborhood of each node is dense, though we do allow d to be 
large as 2^/^ in many settings. Even when d = 0(1) we do not know how to use, for example, 
the weak regularity lemma [14] (a standard tool in those other fields) to recover the communities. 

The use of local sampling in the neighborhood of a single node gives our algorithms the feel 
of ego-centric analysis in sociology. However, when we examine communities as dense subgraphs, 
our algorithms (partially) explore a two-hop neighborhood from a starting node, which is a more 
generalized ego-centric analysis, and is necessary because no one node has ties to the entire commu- 
nity. We also adapt our ideas to the case where communities are not dense graphs (as is plausible 
in really large networks). Though our results are preliminary, that algorithm explores a two- hop 
neighborhood from a starting node. 

Paper Organization. Section 2 presents algorithms for the case when all community sizes are 
within an 0(1) factor of each other. These algorithms are quite efficient and are a good introduction 
to our techniques. Section 3 allows communities to have vastly different sizes, derives the most 
general result (Theorem 5, our Informal Theorem 1), and then studies how to derive more efficient 
algorithms for specialized cases or under additional assumptions. 
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Section 4 shows how in certain cases some of the Assumptions 1-3 can be relaxed. Here there are 
many open problems, which are also discussed in Section 6. Section 5 shows (under some stronger 
assumptions) how to handle the case where each community is a sparse random graph. 

1.2 Related Work 

The above setting seems superficially similar to other planted graph problems that were successfully 
treated using SVD (singular value decomposition) (see McSherry [28] and others). This similarity is 
however illusory because, first, the non-community edges in our model are not necessarily randomly 
distributed, and more seriously, because the SVD techniques are known for finding vertex partitions 
whereas here due to overlap between communities we need to find edge partitions. 

Eppstein, Loffler, and Strash [11] show how to provably find all maximal cliques in time that is 
exponential in the "degeneracy" of the network, which is bounded by the maximum degree in the 
worst case. Our network model, however, allowed graphs with arbitrary degeneracy and also works 
for concepts of community more general than a clique. 

Mishra et al. [29] also study overlapping communities in social networks. They show a simple 
and elegant algorithm for detecting overlapping communities in a certain parameter space. Their 
algorithm works best for communities where the overlap is not too large. In our parameter space, 
to detect a community C with density a and gap e, they require that some v £ C has fewer than 
{a + £ — 1)\C\ neighbors outside of C. This is a strong restriction of the amount of overlapping, 
and is impossible for small e when a is bounded away from 1. 

Related independent work: Balcan et al. [4] have independently studied the problem of 
infering overlapping communities. They have a very different starting point in terms of an explana- 
tory model of how network ties are formed via a preference/ranking function among the individuals. 
Surprisingly, they ultimately arrive at very similar set of "minimal" assumptions and algorithms for 
infering the communities. Perhaps this convergence is some kind of validation of both approaches. 

2 When Communities have Similar Sizes 

This section will give very efficient algorithms to find all communities in the graph when the 
following assumption holds: 

Assumption: Each community has size between 5k and k where 6 > is some constant and k 
is arbitrary but known to the algorithm. 

We continue to make the three assumptions made in Section 1.1, and the parameters n,7, c?, e 
are as defined there. To emphasize, the communities can be arbitrary sets in the graph so long as 
the gap assumption is not violated and each node is in at most d communities. Furthermore, the 
placement of "ambient" (i.e., non-community) edges in the graph can be adversarial as long as it 
does not violate the assumptions. 

The running time is exponential in 1/6, so one would not use these algorithms if communities 
have radically different sizes; that case is handled in Section 3. 

2.1 Warmup: Communities are Cliques 

In this section, "community" is understood to be synonymous with "clique" which corresponds to 
all affinity values pu = 1. 

The algorithm focuses on the neighborhood T{v) of a node v, and takes a random sample of 
nodes S from it. Then it uses brute force (or any other suitable heuristics) to find cliques of size 
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about log d in the graph induced on S, and tries to extend them to communities. (To use sociology 
terms, here egocentric or node-based analysis leads to provably correct socio-centric or societal 
analysis.) The running time is linear in n ■ k, albeit with a big "constant" factor term dependent 
upon d, 6, e, 7. 

Theorem 1 Given a graph satisfying the assumptions in this section, the Clique-Community- 
Find- Algorithm outputs each community with probability at least 2/3 in time ^ 0{nk/6^)2'-'^'^°^ 

The intuition behind why sampling works is that for any node v the neighborhood T(v) has size 
at most kd/'j as communities have size at most k . Each community C containing v lies within 
T(v) and has size at least 6k, which is at least a, 5^/d fraction of T{v). Thus a random sample of 
T{v) will have many representatives from S. The only subtlety is to watch out for "false positives": 
sets that are not cliques but may present themselves as such during sampling. 

Clique-Community-Find- Algorithm 
1: Pick "starting nodes" uniformly at random and do the following: 

2: For each starting node v randomly sample S C V{v) by including each node with probability 
p = \og{12d / £6"f) / Sek. Proceed only if the sample has size at most three times the expectation 

deg(v) \og{12d/eS'y) ^ d\og{l2d/ eS'y) 
5ek — ■ySe 

3: for all cliques U of size at most 2pk in the induced graph G{S) on S do 

4: Let V' be the set of nodes in T{v) which are connected to all nodes in U, and let GiV') be 

the induced subgraph on V' 
5: Let U' be the set of nodes in G{V') whose degree in this subgraph is at least (1 — e/2)|y|. 

Output U' if it is a clique of size at least 6k, and for all v ^ U' , v is connected to at most a 

1 — 8 fraction of U' . 
6; end for 

Proof: (Theorem 1) For any community G the probability that a randomly chosen starting node 
V belongs to it is at least 6k /n. Thus the expected number of times we pick a starting node from 
G is at least 9 and so by a Markov bound such a node is selected with probability 1/9. We show 
that in each such trial the probability that community G is output is at least 7/9. 

So let V £ G. Simple calculation based upon Chernoff bounds shows that each of the following 
sequence of three statements holds with high probability, (i) the subsampling gives a sample S of 
size at most thrice the expectation, (ii) l/elog{12d/e^6) < \S riG\ < 2pk. Note that n C is a 
clique of G{S). Now consider what happens when the for- loop of step 3 tries U = S Ci G. Since 
every node in G has an edge to every node in U, the set V' will contain G. (iii) \V'\ < \G\ +e|C|/4. 
This follows since by the Gap Assumption that each u £ T{v) \ G has edges to at most a (1 — e) 
fraction of nodes in G, and thus the probability that it has edges to each node in the random subset 
(S* n C is less than e. Also, the size of T{v) is at most kd/j, so in expectation the number of nodes 
in V'\G is only (1 - e)l^riC| . ^^/^ < ^\c\/l2. 

However, if |y'\C| < e|C|/4, then we can identify nodes in G just by their degree in G{V'y. Each 
w £ G has degree is at least \G\ > {l—e/2)\V'\; whereas each w ^ G has at most |C|(1— e) + |y\C| < 
(1 - 3e/4)|C| < (1 - £/2)\V'\. Thus the algorithm returns exactly C. □ 

A practical note: In step 3 we enumerate over all cliques of a certain size. With a slight parameter 
modification we can show it suffices to enumerate over all maximal cliques of this size, for which in 
practice one may be able to use existing heuristic algorithms and reduce running time. 

^In this paper O hides polynomial terms of the parameters S, 7, e and also /3, a if they are relevant. 
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Namely, in the proof we pick p = ^°g(^^'^^°g^W'^'^7)/£'^7) ^ ^^^^^ g|.j^j take U to he S Ci C. Then 
Chernoff bound and union bound show that the probabihty that there is a node in S\C that 
connects to every node in S* n C is small. So in this case 5 n C is a maximal clique in G{S). 

2.2 When Communities are Dense Subgraphs 

In the previous section, we equated "community" with "clique" , as has been done in many previous 
works. This assumes that everybody knows everybody else in a "community" — clearly a strong 
assumption as networks get larger (or even in smaller networks where data about adjacencies is 
incomplete). 

In this section, we model a community as a dense subgraph G{k, a) which corresponds to all 
affinity values pu = \foL- All sets have the same affinity \fa\ though this will be relaxed later. If 
nodes u, v are in more than one community, their affinity is still the same and e^^v = a. 

The description of the algorithm will use the following notion, which every community neces- 
sarily satisfies. 

Definition 2 For a,e > an (a, a — e)-set is a subset of nodes such that 1) every node in the set 
has edges to at least an a fraction of nodes in the set; and 2) every outside node has edges to at 
most an a — e fraction of nodes in the set. 

Theorem 3 Given a graph satisfying the conditions of this section with k ^ log n, 

the Dense-Community-Find-Algorithm outputs each community in C with probability at least 

1 — exp(— f](a^e^(5/E)) over the randomness of G and 2/3 over the randomness of the algorithm in 

time 0{n ■ {k/-fa5) ■ 2'^{i°s' '^)). 

Proof: We consider the following algorithm: 
Dense-Community-Find- Algorithm 
1: Randomly choose I00n/6k nodes as starting nodes, for each starting node v repeat the following 

2: Let G{T{v)) be the induced subgraph of v and her neighbors. 

3: Subsample this subgraph by including each node with probability p = ^ ^"^i^fe^fc^^'^^ • -^^^^ this 

round if this graph has size more than three times the expectation. 
4: for all sets U of nodes in the subsampled graph of size at most 2pk do 

5: Let V' be the set of nodes in T{v) which are connected to at least an a — e/2 fraction of all 

nodes in U, let G{V') be the induced subgraph on V' 
6: Let U' be the set of nodes in G{V') such that their degree is at least (a — e/2)|y| 
7: Let U" be the set of nodes that has more than an a — e/2 fraction of edges in to U' . Keep 

U" if it is a (a - e/8, a - 7e/8) set 
8: end for 

To analyze the algorithm, we first assume the graph G is "well formed". The graph G is 
well formed if 1) the number of edges from any node v to any community C is within 1 it e/8 of 
expectation. 2) For any node u, and node v G C in community C, the expected size of r(ii)nr(t;)nC 
is within lib e/8 of expectation. In particular, we know if n G C, |r(u)nr(?;)nC| > (a — e/8)|r(?;)n 

c|; if u c, \r{u) n r{v) n c| < (« - 7e/8)\r{v) n c\. 

Since all the requirements of well-formedness are e far from their expectation, by Chernoff bound 
it is easy to show that the graph G is well-formed with probability at least 1 — exp(— r2(a^e^(5/c)). 

Conditioned on v € G being selected, by Chernoff bounds we show the following statements 
hold with high probability: (i) the subsampling gives a sample of less than 3 times the expectation. 
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(ii) If we choose U to be the intersection of T{v) n C and subsampled nodes, the number of edges 
from most nodes in T{v) to U will be close to expectation, (iii) The symmetric difference of V' and 
T{v) n C is at most e|r('t;) n C|/10, because for any node in T{v) to be in the symmetric difference 
its number of edges to U will have to be e/2 away from expectation. Chernoff bound shows the 
probability of this is at most ae5^/30d, and |r(?;)| < kd/^. 

Since the symmetric difference is so small, and G is well formed, there will be a gap in degree 
for nodes in and outside C. For any u £ |r(i')nC| the number of edges into V is at least {a — e/8 — 
e/10)|r(7;) nC|; for any u^C, the number of edges into V is at most {a - 7e/8 + e/10)\T{v) nC\. 
Hence setting threshold at a — e/2 suffices to distinguish the two cases. The set U' is indeed a 
subset of T{v) nC of size at least a (1 — e/10) fraction. 

Finally, since the graph is well formed, any node u £ C must have at least an a — e/8 — e/10 
fraction of edges to U', and any node u ^ C must have at most an a — 7e/8 + e/10 fraction of edges 
to U' , again a threshold of a — e/2 is enough to distinguish the two cases and U" = C. 

The running time depends on the size of the subsampled nodes, which is of order 0{p ■ kd/j) = 
Q^ 2diogi30d/aeS^) y ^^^^^ running time is 0{n{k/aj6) ■ 0{^^^^^^^f^fP^) = 0{n ■ {k/aj5) ■ 
20(iog2d)^^ □ 

2.2.1 Allowing DifTerent Affinities 

In previous subsection we required edge probabilities eu,v to be exactly a if u,v belong to the 
same community, and this probability does not rise even when they belong in more than one 
community. In real life these requirements may be too stringent. Here we define the Dense-Similar- 
Size Assumptions which relax these two requirements. In this model, the Dense-Community-Find- 
Algorithm may fail, and we give a new algorithm that, unfortunately, is less efficient. 



Dense-Similar-Size {n,k,d,a,5,£,'y) Assumptions. 

Communities satisfy Assumptions 1-3 from Section 1.1 as well as the following: 

-k Each community C £ C has size between dk and k and is generated according to 

Assumption 1 with affinities pu > -v/a. 

•klf u,v are in more than one community then edge has probability eu,v at least as large 

as the maximum requirement [puPv) of all communities that they lie in. 



Theorem 4 Given a graph G and a set of communities C consistent with Dense-Similar-Size Model 
with parameters {n,k,d,a,6,e,^) where d > 2 and k » logn, the Robust-Dense-Community- 
FiND ALGORITHM below outputs each community inC with probability at least l — exp{—Q.{a'^e^dk)) 
over the randomness of G and probability at least 2/3 over the randomness of the algorithm in time 
0{n ■ (A;/a57)2i°s(i0/e)/"+i2O(iog^'^)). 

Proof: The previous algorithm may fail because in this model T(v) n C is no longer a uniform 
subset of G and can be biased. Thus for a vertex u the fraction of edges into the set T(v) H G 
may be quite different from the fraction of edges into G. The idea of the algorithm is that for any 
community C, there is always a set S such that T{S) contains a large (> 1 — e/10) fraction of G. 
A uniform sample on T{S) D G will be similar enough to a uniform sample on C, and the number 
of edges into sample will be close to the expectation. This allows us to get a set that is very close 
to T{S) n G and then extend it similarly as before. 
Robust-Dense-Community-Find 
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1: Let T = 21og(10/e)/a. 

2: Randomly choose IQQn/dk starting nodes, for each starting node v repeat the following 
3: for all sets of nodes S C T{v) of size T do 

4: Let G(r(S')) be the induced subgraph of 5 and their neighbors. 

5: Subsample this subgraph by including each node with probability p = O ( ^°^^^'^aie^k'^'^^ ^ ' ^^^^ 

this round if this graph has size more than three times the expectation. 
6: for all sets U of nodes in the subsampled graph of size at most 2pk do 
7: Let V' be the set of nodes in T{S) which are connected to at least an a — e/2 fraction of 

all nodes in U, let G{V') be the induced subgraph on V' 
8; Let U' be the set of nodes in G{V') such that their degree is at least (a — e/2)|y| 
9: Let U" be the set of nodes that has more than an a — e/2 fraction of edges in to U' . Keep 

U" if it is a (a - e/8, a - 7e/8) set 
10: end for 
11: end for 

We call the graph G well formed if the degree of each node and the number of edges from 
any node to any community are within 1 ± e/8 multiplicative factor of their expectations, also 
for any u,v £ C the size of their intersection in C \T{u) D T{v) D C\ is within 1 it e/8 of the 
expectation. By concentration bounds and union bound, the probability that G is well formed is 
at least 1 — exp{—Q{a'^e'^6k)). We shall assume G is well formed in the discussions below. 

For any community G, when some v £ G is the starting node, let S* be a random subset of T 
nodes in C n T{v). Since the size |r(n) n T{v) n G\ is concentrated for any u,v £ G the probability 
that none of these T nodes are adjacent to u is at most (1 — a + e/8)^ < e/10. Thus the expected 
size of T{S) n C is at least (1 - e/10)|C|. 

We fix a set S such that T{S) H G contains at least a 1 — e/10 fraction of C, and show that 
G is found with good probability. With high probability the subsampling step returns a sample of 
size less than 3 times the expectation. After sampling, fix U to be the intersection of subsampled 
nodes and the community G. Then this [/ is a uniform sample of the set T{S) D G. For any node 
V £ G, the expected number of edges from t; to C/ is at least (a — e/10 — e/8)|f7|; for any node 
f C, the expected number of edges from f to J7 is at most (a — 7e/8)\U\. By Chernoff bound 
these values are e/4 away from expectation (and thus the node is in the symmetric difference of V 
and G n r(S')) with probability less than e^6/120Td. The size of T[S) is at most 2Tkd/j. With 
high probability the symmetric difference (of V' and r(S')nC)has size smaller than e|r(«S') H C|/20. 

Now since V is really close to C R T{S), it is easy to check that for all vertices u £ G Ci V , the 
degree in V' is larger than (a — e/2)\V'\; for all u £ V'\G the degree in V' is smaller. Thus setting 
a threshold at q — e/2 suffices to distinguish these two cases, it follows that U' = V Ci G. Now U' 
is a large subset of G, all vertices u £ G will have more than (a — e/2)\U'\ edges to G while all 
vertices u ^ G have less edges. Setting the threshold at a — e/2 is again sufficient to distinguish 
the two cases and U" = G. 

Finally, the running time of the algorithm depend on the size of the subsampled nodes, which 
is at most 2Tkd/^ ■ p = 0{ ^^ ^"^[^^i^'^^^'^^^ ) . Thus the algorithm runs in time 
n{kd/-fe5f+^0{^^^^^^^^^)^P'^^= Gin ■ (fed/Te^^)^ i°s(W^)/"+i . 20(iog2'i)). □ 

Notice that although the algorithm works for only a fixed value of a, if the communities have 
different densities we can also apply the algorithm with different a parameters to find all commu- 
nities. 
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3 When Communities may have Very Different Sizes 



When communities have very different sizes, the parameter 6 for our models in Section 2 can be 
too small and the algorithms are not efficient. In this section we show we can relax the similar size 
requirement using a quasi-polynomial time algorithm. We can also find cliques of different sizes in 
polynomial time with some additional assumptions. 



3.1 Quasi-polynomial Time Algorithm for Communities of Different Sizes 

When we have quasi-polynomial time, we can find all communities that have at least constant 
density just using assumptions 1, 2 and 3. We only make sure that the minimum density we 
want to find is a constant amin, that is, each community satisfies Assumption 1 with smallest 

Theorem 5 Given a graph G satisfying the assumptions above with parameters (n. A:, d, Ommi 7); 
if all communities are {ac — e/8, ac — ^e/S) sets (which happens with high probability when the 
size of the communities are not too small) the Any-Size-Dense-Community-Find algorithm will 



output all communities in C in time n 



1001og(fcd/7) 
71 



-3 



Proof: When trying to apply previous ideas to this model, the difficulty is that communities have 
very different sizes and sampling will not find small communities. To solve the problem we just 
enumerate over all sets S of size T = i°g(fc^/7) ^ ^j-^ink of all these points are chosen uniformly at 
random from a certain community C. This S will serve as the sampled points, and since it is large 
we can apply union bound to show we will make no error when extending it to a community. 
Any-Size-Dense-Community-Find 



Let T 



100 log(fcd/7) 



for a = 1 downto amin step —e/4 do 
for all sets of nodes S of size T do 

let U be the set of all nodes that has more than a — e/4 fraction of edges to S. 
keep U if it is a (a, a — e/2) set. 
end for 
end for 

For each community C with density ac, there must be a value of a in the loop (line 2) where 
ac > a + e/8 and ac — £ < a — e/2 — e/8 (because the stepsize is e/4). Assume this is the case, 
and let 5" be a uniformly random set of size T in C. For any node v, if v € C, then the expected 
number of edges to the set S is more than a fraction; if u C the expected number of edges to the 
set S is less than a — e/2 fraction. The probability that the number of edges are e/4 fraction away 
from expectation is at most exp(— (e/4)^TQmm) < {kd/^)~'^. We only need to apply union bound 
on the nodes of C and their neighbors, so the size is much smaller than (/cd/7)^. By union bound 
the probability that the algorithm successfully find C is not 0. Since we are trying all possible sets 
S the algorithm will always find all the communities. □ 

Although the algorithm is for dense subgraphs, if run it with amin = 1, it will find all clique 
communities of any size. 



3.2 Polynomial Time Algorithm for Cliques of Different Sizes 

Now we try to improve the quasipolynomial time in Theorem 5 in the subcase when communities 
are cliques of different sizes. The idea will be to reduce the amount of sampling and exhaustive 
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enumeration. To prove this works we need to make assumptions beyond 1, 2, 3. The difficulty is that 
the solution can be highly nonunique and degenerate if communities are allowed to be too "similar." 
For example, suppose node w is not in a community C but is contained in other communities with 
large subsets of C. Should we now consider w to be part of C, since it does have edges to all (or 
most) of C? Our network model assumes such cases do not arise. 

Assumption 4) Communities are fairly distinct. For each node u in community C, at least a 
constant factor, say /3, of C does not lie in any other community containing u. This is in accord 
with the intuitive view of how communities arise: the interconnection structure provides utility to 
its members above and beyond what existed before [12]. 

The next assumption is technical and perhaps was being assumed by the reader all along. 
Surprisingly, we did not need it until now. 

Assumption 5) Completeness Assumption. Any set that satisfies all the assumptions of a community 
in the model is a community. (Also called "Duck Assumption" : "7/ it looks like a duck, quacks like 
a duck, and walks like a duck, it's a duck.'''') This ensures the adversary can't satisfy Assumption 
4 by just pretending that a certain set is not a community even though it looks like one. 

Finally we want to strengthen Assumption 3 so that smaller communities are distinguishable 
in principle from the noise introduced by the ambient edges: 

Assumption 3') Every community a node v belongs to has size at least j/d times the number of 
ambient edges incident on the node v. 

Theorem 6 Given a graph that satisfies all assumptions in this section with parameters 

(n, m, /c, d, /3, e, 7) where that k > 3, the Any-Size-Clique-Community-Find- Algorithm will 

output all communities in C with probability at least 1 — in time 

0(n log n • (ikd/7)i°g(2/£)/^+i • 2^(i°g''i)). 

Proof: The main algorithmic difficulty will be that T{v) for any node v may contain cliques of 
many different sizes. A subsample of T{v) would be likely to hit large cliques quite often, but not 
the smaller cliques. To solve this problem we try to find large cliques first. After cliques of size 
greater than k are found, we can henceforth ignore their edges, and proceed to find smaller cliques. 

Another problem is that after removing all edges in the large cliques, the remaining neighbor- 
hood of V (called T~(v) in the algorithm) may not contain all nodes of a remaining clique. To solve 
the problem the algorithm uses a set S of size T. We should think of S* as a random set in the 
community C, then by Assumption 4 and concentration bounds we know a large fraction of nodes 
in C are in T~(S). 

Any-Size-Clique-Community-Find- Algorithm 

1: Let I = k. 

2: while / > m do 

3: Randomly choose lOOnlogn/l starting nodes , repeat the following for each starting node v 
4: Let G{T(v)) be the induced subgraph of v and her neighbors. 

5: Let T = log(2/e)//3 and enumerate over all subsets of nodes of size T: S* C T{v) such that 

|5| = T. (We are hoping that 5 C C where C is a community of size between 1/2 and /). 
6: for all choices of S do 

7: Let T~{S) be the set of nodes that are connected with some node in S using an edge 
that does not belong to any of the communities already found by the algorithm. Consider 
G(r~(5)), the induced graph of the T nodes in S and their "out of community" neighbors. 
We denote the size of this new induced graph L. 

8: Subsample this subgraph by including each node with probability p = ^^"si^OTd/e^f) p^-^ 
this round if this graph has size more than three times the expectation. 
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9: for all Cliques U in the subsampled graph of size at most 2pl do 

10: Let V' be the set of nodes in T{v) which are connected to ah nodes in J7, and let GiV') 

be the induced subgraph on V . 
11: Let U' be the set of nodes in G{V') whose degree in this subgraph is at least (1 — e/4)|y |. 

Greedily extend U' to a maximal clique U" . Output U" if it is a clique and for all v [/", 

V is connected to at most 1 — e fraction of U" . 
12: end for 
13: end for 
14: Let/ = //2 
15: end while 

We show that if the algorithm correctly finds all cliques of size larger than I, then an iteration 
of the WHILE loop at line 2 will correctly find all communities with size between 1/2 and I with 
probability 1 — The theorem then follows from a union bound. 

Fix a community C of size between 1/2 and /, and assume a node v £ C has already been 
chosen at step 3. Let 5 be a random subset of T{v) of size T. By Assumption 4 we know even 
after ignoring all edges of larger size, the number of remaining edges from any u E C to C is at 
least a /3 fraction, thus |r~(n) H C| > /3\C\. A random set of size T intersects any set of size f3\C\ 
with probability 1 — e/2, therefore in expectation T~{S) contains a 1 — e/2 fraction of C. Since we 
are enumerating over all sets S we can now assume S is such that r~(5) contains at least 1 — e/2 
fraction of C. 

Now similar to Theorem 1 it is easy to check that the following statements hold with high 
probability (i) the subsampling gives a sample of size at most thrice the expectation, (ii) For any 
node u ^ C there is a set of size at least £\C\/2 in r~(S') n C that is not connected to u. Suppose 
we take U to be the intersection of community C H T~(S) and subsampled nodes, then we have 
(in) \V'\ < \T~{S) nC\+ e\C\/15. This is because by Assumption 3' the size of T~{S) is bounded 
by Tld/j and each of the nodes outside C has only exp(— pe|C|/2) probability of being in V' . 

The last event implies V is really close to r~(S') H C. Now in graph G{V') each u € y\|C| has 
degree (1 — e + e/15)|C|; each u £ C has degree at least (1 — e/2)C. This gap enables the algorithm 
to use a threshold of (1 — e/4)|y | to distinguish whether u is in the community C or not. The set 
U' win be equal to T^{S) D C. 

Finally by the Gap Assumption we know during the greedy extension of step 11, we can only 
include nodes in C and in fact will include all nodes in C. Therefore U" = C and the community 
C is found with high probability. 

The running time of the algorithm is dominated by the round when I = k. At that round on 
the size of the subsampled set is at most 0{p ■ Tkd/-y) = (9( '^'^'°s(^'^^'^/^t) )^ we want to find a set 
of size 2pk. Thus the running time is 

0(n log n{kd/jf+^0 C'''°^^^°'^''/'^^ = 0(n log n(A:d/7)i°s(2/e)//3+i20(iog2d))_ 

□ 

We leave it as an open problem to identify reasonable set of assumptions that allow polynomial 
time when communities are dense subgraphs. The problem is that the "duck assumption" is not 
well defined: we know what a clique looks like, but it is hard to tell whether a subgraph looks 
like a community generated according to Assumption 1 when there are overlapping communities 
and ambient edges. We could try to make a stronger duck assumption by assuming every large 
(a, a— e)- set is a community, and then a similar algorithm will be able to find all dense communities 
in polynomial time. But this is not as reasonable as our other assumptions: consider two (a, a — e)- 
sets Ci and C2 of size 2A;/3 and their intersection has size A;/3, then it's quite likely that their union 
Ci U C2 is a (a, a — e) set but we don't consider this set as a community. 
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4 Relaxing the Assumptions 



4.1 Relaxing Assumption 1 

Assumption 1 states that each community's edges are generated according to a expected degree 
model. In this section we note that the algorithms and proofs of Theorems 4 and Theorem 5 
actually apply to a more general setting. 

We first note that we can substantially relax the Dense-Similar-Size Model by replacing As- 
sumption 1 with the following two requirements: 

Concentration: the number of edges from any node u to any community C is concentrated around 
the expectation, that is, Pr[|r(n) n C| [(1 ± e)E[|r(n) n C|]]] < exp(-e2 E[|r(n) n C|]), and the 
degree of each node is concentrated similarly. 

(a, e)-Regularity: for all u.v^C, Vi[\T{u) n T{v) n C| < [(1 - e) E[Q|r(t;) n C|]]] 

< exp(-e2E[a|r(t;) nC|]) 
These properties do not require full dependence, but only limited independence, which could 
be satisfied, for example, by the configuration model [23] which generates a multigraph with 
(nearly) any particular preassigned degree distribution. This definition could also accommodate 
additional structure that introduces dependencies among the edges as long as there is still sufficient 
independence to satisfy Concentration and Regularity. Consider, for instance, the disjoint union of 
two equal-sized cliques with a random bipartite graph of density /3 between them. This is a-regular 
for any (3 > j:^- Thus communities can be much more clumpy than in the expected degree model. 

Remark 1 Theorem 4 still holds with the same proof after replacing Assumption 1 with Concen- 
tration and {a, e) -regularity. 

Remark 2 Theorem 5 still holds with the same proof after replacing Assumption 1 with Concen- 
tration. 

4.2 Relcixing Assumption 2 the Gap Assumption 

Though plausible, the Gap Assumption may not exactly hold in a real-life graph since there may 
be nodes that fall in the "gap." Our algorithm needs to still return sensible answers. Now we 
argue that our algorithms in Theorem 1, and Theorem 3 produce sensible answers even when this 
happens. We use the Clique-Community-Find-Algorithm (Theorem 1) to illustrate. 

Of course in this setting we cannot hope to return the exact communities. Instead the algorithm 
will return some C that contains more than a 1 — e fraction of C and has density more than 1 — e. 

Theorem 7 If G is a graph that satisfies Assumptions 1 and 3, and each community is a clique 
that has size between 5k and k where 6 > is some constant and k is arbitrary but known to the 
algorithm, then the Clique-Community-Find- Algorithm can be adapted so that for any community 
C, the algorithm finds a set C such that \C' H C| > (1 — e)\C\ and for each v £ C , the number of 
edges to C is at least (1 — e)|C"|. 

Proof: The idea is to run the Clique-Community-Find- Algorithm as before. Once we get V 
corresponding to setting U = S D C (recall that under the assumptions of Theorem 1 that this V' 
contains C and has only ||C| nodes outside C), we know with high probability V' consists of 3 
parts: the community C itself, some set of nodes that have more than 1 — e edges to C, and some 
set of nodes that have no more than 1 — e edges to C. In these 3 parts, the first part is what we 
want, the third part is very small by the proof of Theorem 1, and so only the second part worries 
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us. Among these three parts, the nodes in C should tend to have the largest degrees, so we will use 
the degree of these nodes to identify them. For any node v in V' , we call |r(?;) n the density 

of V. The idea will be to run Clique-Community-Find-Algorithm with some parameter e' that is 
somewhat smaller than e to get V . Then repeatedly remove nodes from V' that have density less 
than 1 — e/2 until the density of all remaining nodes is more than 1 — e. We will show by a simple 
calculation that not many nodes in C will be removed. The details are as follows: 

1. Run Clique-Community-Find-Algorithm with e' = /Q\og{d/ 5'^). 

2. Once the algorithm has produced the set V', repeatedly remove all nodes of density less than 
(1 — e/2) until for every v G V\ the density is at least (1 — e). 

Let V = C U H UW , where C is the community, H are the nodes that are connected to more 
than a 1 — e' fraction of C and W are the nodes that are connected to at most a 1 — e' fraction of C. 
By the proof of Theorem 1 \ W\ < e'|C|, thus it is very small and can be ignored in the computation 
below. 

The size of V is at most kd/j, because each node in it is adjacent to the node v in the algorithm. 
We shall show that (i) each iteration removes at most an e/log{d/6j) fraction of nodes in C, and 
(ii) if any iteration removes less than half of the nodes, each node in the remaining graph will be 
connected to at least a 1 — e fraction of nodes. Claim (ii) implies that we will have at most log{d/6'y) 
iterations and then Claim (i) implies that at most an e fraction of nodes in C are removed. 

For (i), notice that as long as less than half of C is removed, the fraction of edges from any 
node in H to the remaining part of C is at least 1 — 2e' . Thus the average density of nodes in C is 
at least 1 — 3e' . By Markov's inequality the fraction of nodes that have density less than 1 — e/2 
is at most 3e'/(e/2) = e / log{d / 6^) . 

For (ii), notice that all remaining nodes had density 1 — e/2, if less than half of the nodes are 
removed, their density should still be larger than 1 — e. 

Notice that since the e' used in Clique-Community-Find- Algorithm is now e'^ /6log{d/5j), 
the running time of the algorithm will be 0{nk/5'y2^^^°^ '^^). □ 

Similar ideas can be used to relax the gap assumption in Theorem 3. The main difficulty of 
applying the same argument is that the edges not from community membership can be adversarially 
chosen. However in real life graphs this is not likely to happen: if two people do not share any 
community the probability that they know each other should be lower. If for all edges (u, v) we 
have the probability Cu^v < 0(, then for any community C the following algorithm can always find 
some set C' with density at least a — e that contains at least a 1 — e fraction of C: 

1. Run Robust-Dense-Community-Find algorithm with e' = e^ /10log{d/6^). 

2. Once the algorithm has produced the set V , repeatedly remove all nodes of density less than 
a — e/2 until the density of every node is at least {a — e). 

The proof is very similar to Theorem 7. First we focus only on the expected degree of nodes. 
Since Cu^v < a we can normalize these probabilities by multiplying by 1/a. Now Cu^v = 1 for nodes 
in a community, and Cu^v < 1 for all pairs u, v. Thus same argument as Theorem 7 shows that 
the algorithm works if we are given the true values of eu,v Then we argue that the algorithm 
should also work even if we are just given a random graph G, because the algorithm only uses the 
degree of nodes in various sets and they all concentrate around their expectation. There are some 
techinalities here when many nodes have expected degree very close to the threshold we are setting, 
which can be resolved if we choose a random threshold between a — e/2 and a — OAe. 
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5 Sparser Communities 



In previous sections we have been talking about communities as dense graphs. This is natural when 
the community considered is small and people inside are closely related, such as people in the same 
year and department in a university. However this may not be true for larger communities: if we 
consider all students in a large university, or even all computer scientists, then it is unlikely that 
every person knows a constant fraction of other people in the community. In this section we show 
how our ideas can be applied to communities that are not so dense. 

Consider a simple set of assumptions ("Sparse") where the affinities for a community pu = 
That is two people in the same community of size k know each other only with probability 
n{l/^/k). We assume the network satisfies Assumption 1 from Section 1.1 as well as the following: 
(1.) Each community has size k, for any two nodes u, v in the same community the probability 
eu,v = B/\/^ where i? is a constant larger than 10. (2.) There are no ambient edges. (3) The 
intersection of any two communities C CiC has size at most k/20(P. 

Notice that we do not need to require the gap assumption nor the duck assumption here because 
they are both implied by property 3 and the fact that every edge is in some community. 

Instead of giving an algorithm to find communities under the "Sparse" assumptions, we show 
that it can be transformed to a graph G' such that G' , C fulfill the Dense-Similar-Size Assumptions. 
Then we can directly apply the Robust-Dense-Community-Find algorithm of Theorem 4 to find 
the communities. 

Theorem 8 Let a graph G and a set of communities C be consistent with the Sparse Model. Con- 
struct a graph G' on the same set of nodes, where u, v has an edge in G' if and only if they have 
at least length-2 paths in G. Then the pair {G',C) are consistent with Dense-Similar-Size 

Assumptions with parameters {n,k,d,a,6,£,^) = (n, /c, d, 0.9, 1, 0.6, l/3(i). 

Proof: We rely on the relaxation of the Dense-Similar-Size model in Section 4.1 using the con- 
centration and (a, e)-regularity. 

For concentration, focus on one community G, notice that once we fix all the edges adjacent to 
u, the probability that v has more than length-2 paths in G are independent for different f 's 

in the same community. This is because the number of length-2 paths is completely determined 
by the number of edges from v to Tg{u), and these are disjoint sets for different v^s. Moreover, 
by symmetry the probability only depends on the number of edges adjacent to u in community 
G. Thus once we fix the degree of u inside C in graph G, all the edges (u, v) where v £ G are 
independent and they satisfy a Chernoff bound. The degree of u itself is also concentrated. 

For (a, e) -regularity, we show that for any u,v S C, the size of their intersection within C 
|r(n) n T{v) n C\ is also concentrated. Just consider randomly choosing T{u) and T{v), with high 
probability both their sizes and the size of their intersection are close to the expectation. In this 
case whether some node w has many length-2 paths to n or f is also independent (because the 
relevant edges are disjoint for different w). Chernoff bounds implies the concentration. 

For the probability of edges a, the expected number of length-2 paths between u and v in the 
same community G is at least \G\ ■ [B/y/k)"^ = B^, by Chernoff bound the probability that the 
number drops down to half of expectation is smaller than 0.1 when B > 10. 

For Assumption 3 in Section 1.1, for each node v, if it is ind' < d communities, by the calculation 
above the expected number of community edges in G' is at least kd' -0.9 -0.9 > O.Skd' (here the first 
0.9 is the probability of an edge within the community, the second 0.9 is because the communities 
may overlap, however by property 3 the overlap must be small). The expected number of length-2 
paths starting from u in G is at most kd' ■ {Bl\pk x kd ■ {B/^/k) = kd'dB^, thus the expected 
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number of edges of v in G' must be smaller than a fraction of this, which is 2kd'd. Since 

O.Skd' /2kd' d > l/3d the number of ambient edges is small. 

Finally, we would like to show the gap assumption: if u C, the expected number of edges from 
ti to C in G' is small. To do that we only need to show the expected number of length-2 paths from u 
to C in G is small. We divide the length-2 paths from n to C into two cases: those that enter C at the 
first step and those that enter at the second step. For the first type, the expected size of T{u) n C is 
only k/20d^-d-B/y/k = B^/k/20d, thus the number of length-2 paths from u to C where the first step 
is inside G is at most B^/k/20d■ k ■ B/Vk = B^k/20d. For the second type, in the second step each 
node only has at most k/Wd"^ ■ d ■ B l\fk = B^/k/20d expected edges to G, thus the total expected 
number of length-2 paths is at most dk ■ B/y/k ■ BVk/20d = B'^k/20. Combining the two cases, we 
know the expected number of length-2 paths from u to C is at most B'^k/2Q + B'^k/20d < B^k/10, 
thus the expected number of edges in G' is at most • 2/B^ = 0.2 < a — e. 

□ 

6 Conclusion 

We introduced a framework for rigorously thinking about community structure that allows (a) 
overlapping communities (b) includes well-studied notions such as cliques and dense subgraphs as 
subcases and (c) yet allows efficient algorithms for recovering the communities. Our assumptions lie 
between worst-case and average-case, are based on a long line of research, and we suspect they hold 
in many generative models. Our sampling-based techniques infer global structure (socio-centric 
analysis) from the neighborhood of vertices (ego-centric analysis). This local versus global frame- 
work, familiar in computer science, may be useful in other settings in sociology, especially because 
ego-centric networks are empirically observed to be dense and thus amenable to our techniques. 
We think our techniques should meld well with existing heuristics and plan to do a performance 
study on real-world data, and also to test the validity of our assumptions. 

Weakening our assumptions is another promising direction, and we made a start in Section 4. 
The Gap Assumption (Assumption 2) makes intuitive sense but probably cannot be guaranteed for 
all network nodes (eg, there will be an occasional node that knows more community members than 
some particular member, yet is not a community member). Our use of the expected degree model 
for the intracommunity edges (Assumption 1) can be weakened somewhat, but it still is a static 
model. 

Arguably community evolution is a dynamic process that results in a more intricate, and possibly 
hierarchical structure. Researchers have started considering two-step models. For example the first 
step could generate an initial graph according to our assumptions and in the second step each 
node connects to each neighbor of a neighbor with some small probability. Making these models 
amenable to efficient community-detection is a good open problem. 
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